Size-Optimizing

This article introduces techniques to reduce the size of Assembler programs.

As you know, I have participated in the Pain Size Coding Compo. The source of my entry was published in issue 4/98 of the diskmag Pain, which was released in the middle of April, as well as the sources of the 22 other competitors. Just a source is nice, but the learning effect isn't high because you only see the final result but not how the whole thing was created. Luckily, I kept all older versions of my entry, from the original 96 bytes version to the final 61 bytes, so that I can use them now. Together with you, I want to explain step by step and byte by byte how I got that far.

[1] What is size-optimizing?

Size-optimizing means to reduce the size of a program as much as possible. In general, not only when it comes to size-optimizing, I distinguish two ways of optimizing, namely

(a) strategic optimisation: finding an optimal, in this case as small as possible algorithm

(b) tactical optimisation: optimisation in detail, e.g. by replacing opcodes by equivalent ones that require less memory

Strategic optimisation is the actual challenge on optimizing. Algorithms are so different to each other that no, or only very few general rules how to optimize algorithms exist. In a word: The strategic optimisation is different in every program, it is the coder's task.

Therefore the main - but not the only - topic of this article concerns the tools of the tactical optimisation. Here certain techniques exist, just mechanical tasks that, once you understand them, you can use them almost every program. Of course these techniques strongly depend on the CPU for which you write a program. As we are a PC-diskmag, we deal with 80x86-compatible CPUs.

[2] Why do I need size-optimizing?

A hint: 4-kbyte-intros. Here size-optimizing is very important, as you might have guessed. Apart from that, size-optimizing appears to have less practical use than speed-optimizing nowadays. After all, there is RAM en masse, there is no lack of storage media either, common PCs are distributed with hard disks of between two and four gigabytes; so what sense does optimizing the size of algorithms make? Perhaps even fighting against every single byte, as is the case in size-coding-competitions? I don't want to deal with the "artistic" aspects size-optimizing has. But a pro-argument certainly is that if you want to create a really small program you need to have a good knowledge of the CPU instructions. That's no different to speed-optimizing, but size-optimizing, of course, has different "criteria" than speed-optimizing. Therefore when you are size-optimizing you get to know other instructions and other sides of common instructions than on speed-optimizing, by which you gain new knowledge about Assembler, CPU and Hardware. That can certainly be only an advantage for you. And more, its fun.

Before I can introduce the original version of my program to you and we can begin with the step-by-step optimizing, it is necessary to know what this program is all about. In short words: I want to tell the task of the Pain Size Coding Compo short, but in enough detail. - We remember: The program, by which I want to show you some ideas for how to strip off some bytes from Assembly programs for 80x86- and compatible CPUs, was my entry for the Pain Size Coding Compo.

[3] Task

Shortest version: You have to write a program that fills a screen area ("block") that has been selected by the user in a colour that has also been selected by the user.

The program structure in detail:

After the start of the program it shall wait for two inputs.

At first the user has to input the number of the block that is supposed to be filled. This number has to be between 1 and 5. If the user enters zero or a number that is bigger than 5, the program shall quit. This value, the number of the to-be-filled block, is called nbb in the original rules. In contrast to that, I will call the memory address or the register, where this value will be stored temporarily, nbb.

The second value is the colour of the block. If the user inputs 0, the program shall quit. From now on this colour will be called clr, like in the original rules. After that, the program shall switch into mode 13h (320x200, 256 colours) and draw the selected block in the selected colour. Finally it has to wait for a keypress, return to mode 3 (text mode) and quit without crashing.

What do these blocks look like? The following ascii graphic shows it. It comes from the organizers of the compo, that means from Zaphod and Dake; I changed it a bit so that it's easier to understand.

                                   320 pixels
             I-----------------------------------------------I

             +-----------------------------------------------+  ---
             Ś                       |                       Ś   Ś
             Ś   BLOCK #1            |            BLOCK #2   Ś   Ś
             Ś                       |                       Ś   Ś
             Ś          .-------------------------.          Ś   Ś
             Ś          |                         |          Ś   Ś
             Ś__________|         BLOCK #5        |__________Ś   Ś 200 pixels
             Ś          |                         |          Ś   Ś
             Ś          |                         |          Ś   Ś
             Ś          `-------------------------'          Ś   Ś
             Ś                       |                       Ś   Ś
             Ś   BLOCK #3            |            BLOCK #4   Ś   Ś
             Ś                       |                       Ś   Ś
             +-----------------------------------------------+  ---

Which means:

 block   start position (x/y)      x-width      y-height     cut out part
 1       (0/0)                     160          100          bottom right
 2       (160/0)                   160          100          bottom left
 3       (0/100)                   160          100          top right
 4       (160/100)                 160          100          top left
 5       (80/50)                   160          100          nothing

You see that the first four blocks are very similar to each other. Always the part that is covered by block 5 is cut out. In contrast to that, block 5 isn't covered by any other block; but it is the block that covers the other blocks! Right now we can state that this block will have an important role in regards to the strategic optimisation of the program.

However, there are some important rules to worry about:

- The program must run on an 80486.

- Right after the start of the program, AX and BX are zero and the direction flag is cleared. The program can rely on these initial register states. It is true that there are also other registers that have a defined state after the start, e.g. DI, which usually is 0fffeh, or SI, which has the same value as IP, which means 100h in COM-files, but the program must not assume any of these states except AX==0, BX==0 and DF==0.

- On checking the entries only the keys 0..9 will be used for inputting nbb and clr. That means that you don't have to worry about what will happen if the user enters e.g. Enter for nbb, which would be an illegal input. Only 0 and the keys 6..9 have to be prohibited for nbb, and regarding clr only 0 is illegal.

- When the program waits for a keypress after drawing the block, the Space key will be used. This information might also give some ideas for optimization.

Those were all rules you have to know to understand the further explanations.

[4] Program structure

It's always a good start if you think about the program structure and maybe write it down as pseudocode or in a high level language program. What follows is a simple (Borland-)C-implementation of blocks.

#include <conio.h>
#include <stdlib.h>

//pointer to video-ram
char far * screenptr = (char far *) ((long)0xa000 << 16);

//putpixel-function
void pset13h(unsigned int x,unsigned int y,unsigned char col,char far *where)
{
  *(where+320*y+x)=col;
}


//draw a filled rectangle with the given co-ordinates and colour
void drawblock(unsigned int startx,unsigned int starty,unsigned char clr)
{
  unsigned char i,
                j;

  for(i=0;i<100;i++)
  {
    for(j=0;j<160;j++)
    {
      pset13h(startx+j,starty+i,clr,screenptr);
    }
  }
}

void main()
{
  unsigned char nbb,
                clr;
  unsigned int startx,
               starty;

  //get nbb - mustn't equal 0, mustn't be greater than 5
  nbb=getch();
  nbb-=48;
  if( (nbb==0) || (nbb>5) ) exit(0);

  //get clr - mustn't equal 0
  clr=getch();
  clr-=48;
  if(clr==0) exit(0);

  //switch to mode 13h
  asm {
    mov ax,0x13
    int 0x10
  }

  //draw which block?
  switch(nbb)
  {
    case 1: startx=0;
            starty=0;
            break;
    case 2: startx=160;
            starty=0;
            break;
    case 3: startx=0;
            starty=100;
            break;
    case 4: startx=160;
            starty=100;
            break;
    case 5: startx=80;
            starty=50;
  }

  //draw block
  drawblock(startx,starty,clr);

  //clear center if it's not block 5
  if(nbb!=5) drawblock(80,50,0);

  //wait for keypress
  getch();

  //return to mode 3
  asm {
    mov ax,3
    int 0x10
  }
}

Quickly done, yeeees... That is enough for the beginning. Now you have to start thinking.

(a) Keyboard input

Since getch() returns ASCII codes we have to subtract 48, the ASCII code of the character 1, from nbb and clr to get numeric values.

The input routine for clr comes right after nbb. Would a loop make more sense? A pro for a loop is the fact that the input routines for nbb and clr are very similar already in the C-code, a contra is that on inputting nbb, zero has to be prohibited as well as numbers higher than 5, whereas for clr only zero is an invalid input. But this problem can also be avoided. Look at the following variant:

  #define nbb value[0]
  #define clr value[1]
  unsigned char value[2],
                counter;

  [...]


  //get nbb and clr
  for(counter=0;counter<2;counter++)
  {
    value[counter]=getch()-48;
    switch(counter)
    {
      case 0:  if( value[counter]>5 ) exit(0);
      default: if( value[counter]==0) exit(0);
    }
  }

Good. The disadvantage is very obvious: A switch case is required. In Assembler the result would be a chain of CMP- and Jx-instructions. That's not the best thing to do. Let's start without a loop. - Later we will see that a loop, if it is well optimized, is the better solution nevertheless :)

(b) Mode 13h init (and return to mode 3)

Straightforward. At least in the C-version you cannot optimize it any more.

(c) Draw the block

Now it starts to become interesting. Firstly: Already in this early version an obviously strategic way of optimisation is exploited, namely the fact that block 5 covers all other blocks. Therefore the program first draws a filled rectangle with the length (160/100) and the x/y co-ordinates of the selected block as the start point; after that block 5 will be drawn in colour 0, by which the covered part gets erased automatically. The only exception is block 5 - here, nothing will be erased.

Secondly: The x/y co-ordinates of the blocks 1..4 are conspicuous; the x-axis can be either 0 or 160, the y-axis either 0 or 100. A possibility to take advantage of that would be to save only 0 or 1 for each axis. This value would have to be multiplied with 160 or 100 respectively to get the actual axis. The values 0 and 1 could be stored in a single bit, which means that the co-ordinates of the first four blocks could be stored in a single byte; by that one could strongly reduce the space for the co-ordinate-data. The disadvantage is the relatively complicated use of multiplication and boolean operators in Assembler - multiplication would have to be done using MUL or IMUL because neither 100 nor 160 is a power of 2, which makes using shift-instructions impossible. So the first value would have to be stored in the accumulator and the other in another register, and for the bit-instructions we would have to shift nbb, which would cost some bytes again.

So let's remain conservative at the beginning, calculate the start offsets of the blocks from the x/y co-ordinates and store them in our Assembly program with DW as a sort of lookup-table.

(d) Wait for keypress

In C also that is a thing where you don't have to think alot - in Assembler this can probably be shortened.

[5] Assembler implementation

We do everything similarly to the high-level language version, but we pay attention to Assembly-specific things; at first we write the code quickly, and later we can worry about details.

My first entry for the Pain Coding Compo looked like the following. The first implementation doesn't have to be better than this - 96 bytes.

;blocks.asm (c) 1998 by adok^hugi
;adok@blackbox.at / hugi.home.pages.de
;my entry for da pain coding compo

code segment
assume cs:code,ds:code,es:code
org 100h

start:
 ;prepare data pointer
 lea di,data

 ;get nbb
 cld
 int 16h
 sub al,49      ;al minus ascii code of 1 => overflow when 0 => saved one cmp
 cmp al,4
 ja end_prog
 mov bl,al
 shl bl,1

 ;get clr
 xor ah,ah
 int 16h
 sub al,48
 or al,al
 je end_prog
 mov dl,al

 ;switch into mode 13h
 mov ax,13h
 int 10h

 ;load es
 mov ax,0a000h
 mov es,ax

 ;read start position of the block
 add di,bx
 mov di,[di]

 ;draw block
 mov al,dl
 call drawblock

 ;if block isn't 5: clear center
 xor ax,ax
 mov di,16080
 cmp bl,8
 je start_label1
 call drawblock

start_label1:
 ;wait for keypress
 int 16h

end_prog:
 mov ax,3
 int 10h
 ret

 ;procedures
 drawblock proc near
  mov dh,100
  drawblock_label1:
    mov cx,160
    rep stosb
    add di,160
    dec dh
    or dh,dh
    jne drawblock_label1
  retn
 drawblock endp

 ;data
 data dw 0
      dw 160
      dw 32000
      dw 32160
      dw 16080

code ends
end start

That was indeed quite similar to the C-program. Therefore I shall name the differences, rather than the things that are common.

(a) Keyboard inputs

Function AH=0 of Interrupt 16h is used for inputting of nbb and clr as well as for waiting for a keypress at the end of the program. This BIOS-function makes the CPU wait - only Interrupts are allowed in the background - until a key was pressed. Its scancode will be returned in AH and the ASCII-code, if existing, in AL.

The enterred values are checked, and if they are illegal, the program will jump to the label end_prog at once, where the screen will be cleared by re-initializing mode 3 and the program will quit.

At the nbb check we find the first tactical optimisation: AL (which contains nbb at that instance) is not decreased by 48 but by 49. This will cause an underflow if the user enterred zero. Now instead of 0, AL contains 0ffh, and therefore cmp al,4 ja end_prog is enough to prohibit both inputs higher than 5 and inputs that are zero.

Attention: Now, of course, nbb is smaller than it ought to be: If it is 0, it indicates block 1 - and so on.

(b) Data structure

Not the x/y co-ordinates but the start offsets of the blocks are stored; moreover, they aren't passed to the drawblock procedure immediately but loaded from a table. With lea di,data, the offset of this table gets assigned to DI; BX, which contains the double of nbb due to SHL, gets added to DI, by which you get the offset of the selected block; with mov di,[di] this offset is passed to DI - the previous value of DI isn't needed anymore.

(c) Drawblock

The block is drawn using a loop containing rep stosb. After that DI gets the offset of block 5, unless nbb isn't equal to 8, which would indicate that block 5 was selected; this block is drawn in colour 0, by which the parts of the other blocks that are covered by block 5 are erased automatically.

(d) The end

Since AH is zero, we don't have to clear it again but can call Interrupt 16h at once. To return to mode 3, however, AH must be cleared again because its value was changed by Interrupt 16h. The program is exited using RET. That's a completely legal way unless the stack has been messed up.

Let's come to the actual topic of this article now.

[6] Optimisation

I've tried to order this chapter roughly the same way as I did the optimizing. This order isn't the optimum at all, but in this way we will see at least that tactical optimisations are nice and good but later can prove to be unnecessary because of strategic optimisations. In short words: First the structure has to be okay, then worry about the details!

Let's start by checking what can be optimized at all.

(a) The CLD call is simply superfluous because the rules say that the direction flag is set to zero after the start of the program automatically. we delete CLD and save a byte. Now our program's size is 95 bytes.

(b) lea di,data and add di,bx are superfluous as well. Why calculate in realtime when there is a better solution? We remove both instructions and replace mov di,[di] by mov di,data[bx]. This way we save three more bytes - 92 bytes.

(c) Again a superfluous instruction: or al,al is not necessary because the zero flag has already been set or cleared by SUB. There is a second OR that we don't need, namely in drawblock: or dh,dh. Remove it! 88 bytes.

(d) Now a bit more strategy. The start offset of block 5 appears twice in our program - what shall we do? Replacing mov di,16080 by a memory access is of no use. Instead, one can remove block 5 from the offset table and, by rearranging some instructions, have it displayed nevertheless. Replace the following code

         ;draw block
         mov al,dl
         call drawblock

         ;if block isn't 5: clear center
         xor ax,ax
         mov di,16080
         cmp bl,8
         je start_label1
         call drawblock

by:

         ;set block color
         mov al,dl

         ;block 5 check
         cmp bl,8
         je block5

         ;otherwise: read start position of the block
         mov di,data[bx]
         call drawblock

         ;if block isn't 5: clear center
         xor al,al
        block5:
         mov di,16080
         call drawblock

The section "load es" must be placed before "switch into mode 13h" to set AH to zero, otherwise it wouldn't work. Little effort - 86 bytes.

(e) Since CH is zero right after the start of the program it would be possible to save a byte by replacing mov cx,160 by mov cl,160; but that doesn't conform to the rules and therefore won't be done.

That was the first, spontaneous optimisation. Let's go on.

(f) We come to one of the best tactical optimisation helpers: XCHG. If one of the "to-be-exchanged" registers is AX, XCHG costs only a single byte. You can use that e.g. to clear registers, move values of AX which aren't needed in AX anymore to other registers or save them temporarily. Our current program already has three points where it is possible to use XCHG: the first, is the nbb input routine, where we can replace mov bl,al by xchg bx,ax xor bh,bh, by which we save xor ah,ah before the second Int 16h call; instead of mov dl,al we can write xchg dx,dx, and as CX is zero after exitting the procedure drawblock we can replace the xor ax,ax in the section "clear center" by xchg cx,ax. 83 bytes.

(g) DEC works with both byte and word registers, but using byte registers takes a byte more. However, that's exactly what we do in drawblock as we decrement DH. To save a byte you would have to use DL instead of DH. ... Well, does it work? Appearently not really because we forgot that DH isn't zero, which causes the loop to be repeated more often than planned! We need a register whose hibyte is zero at that moment, and this register is BX. So we replace mov dh,100 by mov bl,100 and dec dh by dec bx. Works! 82 bytes.

(h) The procedure drawblock uses the value 160 twice, which can be avoided by replacing the code between drawblock_label1 and dec bx by the following:

         drawblock_label1:
           mov cx,160
           add di,cx
           rep stosb
           dec bx

However, to get the desired result we have to subtract 160 from the offset data:

        block5:
         mov di,15920
         call drawblock

and

        ;data
        data dw 65376
             dw 0
             dw 31840
             dw 32000

Two bytes have been saved again. 80 bytes.

(i) How do we load ES? - Via AX. But why should be use a register when we also have the stack? By replacing the section "load es" by

        ;load es
        push 0a000h
        pop es

we save a byte since pop es has its own opcode. Really? No, the program's size increases by six bytes. The reason for that is simple - we forgot to activate 80286-instructions, which we do by writing

       .286

before the start of the code segment. Now we have the desired effect; that means, our program's size is 80 bytes. If you want you can move "load es" to the beginning of the program for cosmetic reasons.

(j) Another nice instruction is CBW. It extends AL to 16 bit by setting all bits of AH to the state of bit 7 of AH (the sign bit). That means: If we know that bit 7 of AL is zero in certain situations, we can clear AH with CBW and save a byte compared to xor ah,ah. This very trick can be used now; the section "get clr" ends with

        xchg bx,ax
        xor bh,bh

We exchange the order of the XCHG and XOR instuctions, and so we get

        xor ah,ah
        xchg bx,ax

And we replace xor ah,ah by:

cbw

78 bytes.

(k) In (e) I said that CH is zero after the start of the program; so we would be able to replace mov cx,160 by mov cl,160 and save a byte if the rules didn't say that the initial state of CX was undefined. Nevertheless it's possible to kill this byte if we don't save clr in DX but in CX.

        ;get clr
        int 16h
        sub al,48
        je end_prog
        xchg cx,ax

Now, of course, we have to adjust the section "set block color".

        ;set block color
        xchg ax,cx
        cbw

The last thing to do is to replace CX by CL in drawblock. 77 bytes.

(l) You do not necessarily need to use CMP to compare values - common arithmetic operations set the flags as well; don't forget that CMP is just a variant of SUB. Why, therefore, do we check if nbb is a legal by first subtracting 49 and then testing if the result is higher than 4? Why don't we simply subtract 53 from nbb? Let's do that. Instead of

        sub al,49
        cmp al,4

we get:

        sub al,53

Of course we must not forget that we have to compute with other nbb-values now than we are used. That means that we have to re-write all routines that have something to do with nbb.

Multiplying nbb with two makes no sense at this place - so let's remove the instruction shl bl,1. The check whether block 5 was selected is wrong, too; we replace it by:

        ;block 5 check
        add bl,4
        ja block5

Quite a clever thing, because it doesn't only cause nbb to be tested, but it will also increment it by 4. This value just has to be doubled, which is the reason we insert shl bl,1 before mov di,data[bx].

If we start the program now, we will realize that this wasn't all. We have forgotten one thing: due to sub al,53 bit 7 of AL will always be set after legal inputs - and because of that CBW doesn't work as planned because it sets all bits of AH, which means that AH gets the value 0ffh instead of 0. Since Interrupt 16h never returns a value in AL whose highest bit is set (except of extended scancodes, which, according to the rules, won't be used anyway), this problem can be solved by cutting the cbw after ja end_prog and inserting it after the first Int 16h call. So the section "get nbb" has to look like the following:

        ;get nbb
        int 16h
        cbw
        sub al,53
        ja end_prog
        xchg bx,ax

That was all - 75 bytes.

(m) With a dirty trick one can get to 74 bytes, namely by removing the RET after switching to the text mode. This way the procedure drawblock will be run again but as we are in the text mode writing to the segment 0a000h has no apparent consequence. Nevertheless I advise you against using this method because it might cause trouble, just think of protected mode.

(n) And away with that dirt! It's better not to call drawblock as a procedure but to insert it directly into the main program after mov di,data[bx]. But we need some kind of loop for that. Let's define a label before the section "set block color". I called it draw. Now after the drawblock-code you would have to test whether block 5 has been drawn, that means nbb is 4, and, if the answer is no, jump back to the label draw. AX wouldn't have to be set to zero because at that point CX is zero anyway and since the values of AX and CX get exchanged after the label draw AX will be set to zero automatically. The problem rather is that also BX, in which nbb is temporarily stored, is 0 at that moment, so the value of nbb has been lost.

What can be done against that? Either you replace BX in the drawblock-loop by DX, which doesn't solve the problem how to write the break condition of the big loop, or you change more things, possibly losing some previous tactical optimizations. I chose the latter thing. A possible solution looks like the following:

The offset of block 5 is in the data table again, like the offsets of all other blocks.

        ;data
        data dw 65376
             dw 0
             dw 31840
             dw 32000
             dw 15920

sub al,53 in the section "get nbb" gets replaced by:

        sub al,49
        cmp al,4

In the section "block 5 check" we don't jump at once if nbb is 4 but we save the result of the test with pushf. This way the label block5 becomes superfluous, too.

        ;block 5 check
        cmp bl,4
        pushf

This test result will be evaluated in "clear center" after setting BL to 4 (block 5). If block 5 hasn't been drawn yet, which is the case if the zero flag is set after popf, the execution will be resumed at the label draw. This way block 5 will be drawn in zero, which means that the part of the selected block that is covered by block 5 gets erased.

        ;if block isn't 5: clear center
        mov bl,4
        popf
        jne draw

Three less bytes - 72 bytes.

At this point it might be good to show the complete source of the latest version - here it is.

.286
code segment
assume cs:code,ds:code
org 100h

start:
 ;load es
 push 0a000h
 pop es

 ;get nbb
 int 16h
 cbw
 sub al,49
 cmp al,4
 ja end_prog
 xchg bx,ax

 ;get clr
 int 16h
 sub al,48
 je end_prog
 xchg cx,ax

 ;switch into mode 13h
 mov ax,13h
 int 10h

draw:
 ;set block color
 xchg cx,ax
 cbw

 ;block 5? save test result
 cmp bl,4
 pushf

 ;read start position of the block
 shl bl,1
 mov di,data[bx]

 ;draw the block
 mov bl,100
drawblock_label1:
 mov cl,160
 add di,cx
 rep stosb
 dec bx
 jne drawblock_label1

 ;if block isn't 5: clear center
 mov bl,4
 popf
 jne draw

 ;wait for keypress
 int 16h

end_prog:
 mov ax,3
 int 10h
 ret

 ;data
 data dw 65376
      dw 0
      dw 31840
      dw 32000
      dw 15920

code ends
end start

(o) Testing whether block 5 was selected and saving the result for later evaluation isn't the best solution; it would be better to do the test at the point where it's needed. As I've already said, this is impossible because the value of BX a.k.a. nbb gets lost in the drawblock loop. But what about saving the value of BX on the stack or in a register? If we take a look at "read start position of the block", SI appears to be the ideal register for this task as BX and SI can be combined for indirect memory accesses. Instead of doubling BX you copy the value of BX to SI and set DI to the content of data[bx+si] - this costs the same memory space.

        ;read start position of the block
        mov si,bx
        mov di,data[bx+si]

Now, of course, "clear center" has to be changed, which is simple:

        ;block 5? yes: clear center
        mov bl,4
        cmp si,bx
        jne draw

Since we don't need "block 5 check" any more and therefore can - and even must - remove it, we save three bytes. Now the program has 69 bytes.

(p) The sections "get nbb" and "get clr" have an almost identical structure. Let's build a loop!

        ;get nbb & clr
       get_nbb:
        cmp al,5
        ja end_prog
        xchg ax,bx
        int 16h
        cbw
        sub al,48
        je end_prog
        or bl,bl
        je get_nbb
        xchg cx,ax

Now we have to write mov di,data-2[bx+si] instead of mov di,data[bx+si] and replace mov bl,4 by mov bl,5 in "clear center" because nbb has its original value again.

And what use does that have? At the moment nothing except the fact that the cbw for setting the block colour has become superfluous - 68 bytes.

(q) JCXZ is a very short instruction to check whether CX is zero. We don't need that in our program but we can use JCXZ nevertheless. Our problem is that, according to the rules, the xchg cx,ax at the end of the "get nbb & clr" section can overwrite AX with a value different than zero. Therefore AX and not just AL has to be set to 13h to initialize mode 13h. However, we know that BH is zero after the start of the program, which doesn't change in the whole program. Another thing we know is that

        or bl,bl
        je get_nbb

doesn't eat more memory than:

        mov bx,cx
        jcxz get_nbb

So if we replace the current solution by the one above, it is guaranteed that after xchg cx,ax AH is zero, and we can replace mov ax,13h by mov al,13h, which brings us to 67 bytes.

And what about this variant?

        ;get nbb & clr
        cmp al,5
        ja end_prog
        mov cx,ax
        xchg bx,ax
        int 16h
        cbw
        sub al,48
        je end_prog
        jcxz start
        xchg cx,ax

Instead of BH, AH provides that CH gets cleared, and as a side effect CX contains nbb after the end of the loop - maybe the basis of a later optimisation.

(r) If it was possible to replace

        ;if block isn't 5: clear center
        mov bl,5
        cmp si,bx
        jne draw

        ;if block isn't 5: clear center
        dec si
        jns draw

three bytes would be saved. This method would work if the offset of block 5 was located at the address ds:data-2. But that's not the case. What we have to do is to invert the order of the offset data. Which means:

        ;data
        data dw 15920
             dw 32000
             dw 31840
             dw 0
             dw 65376

Now we have to make sure that BX (+ SI) points to the correct offset after inputting nbb. A possible solution: insert the command neg bx after "get nbb & clr" and add 6 to BX. But this way the advantage of dec si jns draw would get lost. It's better to embed the conversion into the loop like this:

        ;get nbb & clr
        mov bl,5
        sub bl,al
        js end_prog
        mov cx,ax
        int 16h
        cbw
        sub al,48
        je end_prog
        jcxz start
        xchg cx,ax

How good that there is a rule saying that only the keys 0..9 will be used for checking the entries! Yes, this code doesn't prevent other illegal inputs such as Enter any more. Moreover, the line mov di,data-2[bx+si] has to be changed to mov di,data[bx+si] again. 65 bytes.

(s) Why should offsets always be used? Did you notice that each of the block offsets can be divided by 16? That means what we could use block segments instead by dividing the block offsets by 16 and adding 0a000h (40960d).

        ;data
        data dw 41955
             dw 42960
             dw 42950
             dw 40960
             dw 40950

We don't need "load es" any longer but

        mov es,data[bx+si]
        xor di,di

instead of:

        mov di,data[bx+si]

The result gives a size of 63 bytes.

(t) The final trick is one of the most interesting ones. In short words: We want to stop using BX as the counter of the loop, since that can also be done with DI.

We know that each block has a size of 160x100 pixels, which means that the area is 32000 pixels (the covered parts included). If we don't set DI to zero but to -32000 (= neg 32000 or 65536-32000=33536) in "read start position", drawblock can be rewritten like follows:

        ;draw the block
       drawblock_label1:
        mov cl,160
        add di,cx
        rep stosb
        js drawblock_label1

However, the values in the data table aren't correct now. If we divide 33636 bytes by 16, we get 2096. This value has to be subtracted from the block segment, by which we get:

        ;data
        data dw 39859
             dw 40864
             dw 40854
             dw 38864
             dw 38854

61 bytes.

That was all! That should be enough for the beginning... You find the original version and the - slightly changed - final version among the other sources.

- adok^hugi